{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "9e161a8a-fcf0-4d55-933e-da271ce28d7e",
      "metadata": {},
      "source": [
        "# How to handle long text\n",
        "\n",
        ":::info Prerequisites\n",
        "\n",
        "This guide assumes familiarity with the following:\n",
        "\n",
        "- [Extraction](/docs/tutorials/extraction)\n",
        "\n",
        ":::\n",
        "\n",
        "When working with files, like PDFs, you're likely to encounter text that exceeds your language model's context window. To process this text, consider these strategies:\n",
        "\n",
        "1. **Change LLM** Choose a different LLM that supports a larger context window.\n",
        "2. **Brute Force** Chunk the document, and extract content from each chunk.\n",
        "3. **RAG** Chunk the document, index the chunks, and only extract content from a subset of chunks that look \"relevant\".\n",
        "\n",
        "Keep in mind that these strategies have different trade offs and the best strategy likely depends on the application that you're designing!"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "57969139-ad0a-487e-97d8-cb30e2af9742",
      "metadata": {},
      "source": [
        "## Set up\n",
        "\n",
        "First, let's install some required dependencies:\n",
        "\n",
        "```{=mdx}\n",
        "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n",
        "import Npm2Yarn from \"@theme/Npm2Yarn\";\n",
        "\n",
        "<IntegrationInstallTooltip></IntegrationInstallTooltip>\n",
        "\n",
        "<Npm2Yarn>\n",
        "  @langchain/openai @langchain/core zod cheerio\n",
        "</Npm2Yarn>\n",
        "```\n",
        "\n",
        "Next, we need some example data! Let's download an article about [cars from Wikipedia](https://en.wikipedia.org/wiki/Car) and load it as a LangChain `Document`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "id": "571aad22-2cec-4b9b-b656-5e4b81a1ec6c",
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "\u001b[33m97336\u001b[39m"
            ]
          },
          "execution_count": 1,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import { CheerioWebBaseLoader } from \"@langchain/community/document_loaders/web/cheerio\";\n",
        "// Only required in a Deno notebook environment to load the peer dep.\n",
        "import \"cheerio\";\n",
        "\n",
        "const loader = new CheerioWebBaseLoader(\n",
        "  \"https://en.wikipedia.org/wiki/Car\"\n",
        ");\n",
        "\n",
        "const docs = await loader.load();\n",
        "\n",
        "docs[0].pageContent.length;"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "af3ffb8d-587a-4370-886a-e56e617bcb9c",
      "metadata": {},
      "source": [
        "## Define the schema\n",
        "\n",
        "Here, we'll define schema to extract key developments from the text."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "id": "a3b288ed-87a6-4af0-aac8-20921dc370d4",
      "metadata": {},
      "outputs": [],
      "source": [
        "import { z } from \"zod\";\n",
        "import { ChatPromptTemplate } from \"@langchain/core/prompts\";\n",
        "import { ChatOpenAI } from \"@langchain/openai\";\n",
        "\n",
        "const keyDevelopmentSchema = z.object({\n",
        "  year: z.number().describe(\"The year when there was an important historic development.\"),\n",
        "  description: z.string().describe(\"What happened in this year? What was the development?\"),\n",
        "  evidence: z.string().describe(\"Repeat verbatim the sentence(s) from which the year and description information were extracted\"),\n",
        "}).describe(\"Information about a development in the history of cars.\");\n",
        "\n",
        "const extractionDataSchema = z.object({\n",
        "  key_developments: z.array(keyDevelopmentSchema),\n",
        "}).describe(\"Extracted information about key developments in the history of cars\");\n",
        "\n",
        "const SYSTEM_PROMPT_TEMPLATE = [\n",
        "  \"You are an expert at identifying key historic development in text.\",\n",
        "  \"Only extract important historic developments. Extract nothing if no important information can be found in the text.\"\n",
        "].join(\"\\n\");\n",
        "\n",
        "// Define a custom prompt to provide instructions and any additional context.\n",
        "// 1) You can add examples into the prompt template to improve extraction quality\n",
        "// 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
        "//    about the document from which the text was extracted.)\n",
        "const prompt = ChatPromptTemplate.fromMessages([\n",
        "  [\n",
        "    \"system\",\n",
        "    SYSTEM_PROMPT_TEMPLATE,\n",
        "  ],\n",
        "  // Keep on reading through this use case to see how to use examples to improve performance\n",
        "  // MessagesPlaceholder('examples'),\n",
        "  [\"human\", \"{text}\"],\n",
        "]);\n",
        "\n",
        "// We will be using tool calling mode, which\n",
        "// requires a tool calling capable model.\n",
        "const llm = new ChatOpenAI({\n",
        "  model: \"gpt-4-0125-preview\",\n",
        "  temperature: 0,\n",
        "});\n",
        "\n",
        "const extractionChain = prompt.pipe(llm.withStructuredOutput(extractionDataSchema));"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "13aebafb-26b5-42b2-ae8e-9c05cd56e5c5",
      "metadata": {},
      "source": [
        "## Brute force approach\n",
        "\n",
        "Split the documents into chunks such that each chunk fits into the context window of the LLMs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "id": "27b8a373-14b3-45ea-8bf5-9749122ad927",
      "metadata": {},
      "outputs": [],
      "source": [
        "import { TokenTextSplitter } from \"langchain/text_splitter\";\n",
        "\n",
        "const textSplitter = new TokenTextSplitter({\n",
        "  chunkSize: 2000,\n",
        "  chunkOverlap: 20,\n",
        "});\n",
        "\n",
        "// Note that this method takes an array of docs\n",
        "const splitDocs = await textSplitter.splitDocuments(docs);"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5b43d7e0-3c85-4d97-86c7-e8c984b60b0a",
      "metadata": {},
      "source": [
        "Use the `.batch` method present on all runnables to run the extraction in **parallel** across each chunk! \n",
        "\n",
        ":::{.callout-tip}\n",
        "You can often use `.batch()` to parallelize the extractions!\n",
        "\n",
        "If your model is exposed via an API, this will likely speed up your extraction flow.\n",
        ":::"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "id": "6ba766b5-8d6c-48e6-8d69-f391a66b65d2",
      "metadata": {},
      "outputs": [],
      "source": [
        "// Limit just to the first 3 chunks\n",
        "// so the code can be re-run quickly\n",
        "const firstFewTexts = splitDocs.slice(0, 3).map((doc) => doc.pageContent);\n",
        "\n",
        "const extractionChainParams = firstFewTexts.map((text) => {\n",
        "  return { text };\n",
        "});\n",
        "\n",
        "const results = await extractionChain.batch(extractionChainParams, { maxConcurrency: 5 });"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "67da8904-e927-406b-a439-2a16f6087ccf",
      "metadata": {},
      "source": [
        "### Merge results\n",
        "\n",
        "After extracting data from across the chunks, we'll want to merge the extractions together."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "id": "30b35897-4d94-44ad-80c6-446eff61b76b",
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[\n",
              "  { year: \u001b[33m0\u001b[39m, description: \u001b[32m\"\"\u001b[39m, evidence: \u001b[32m\"\"\u001b[39m },\n",
              "  {\n",
              "    year: \u001b[33m1769\u001b[39m,\n",
              "    description: \u001b[32m\"French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle.\"\u001b[39m,\n",
              "    evidence: \u001b[32m\"French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769.\"\u001b[39m\n",
              "  },\n",
              "  {\n",
              "    year: \u001b[33m1808\u001b[39m,\n",
              "    description: \u001b[32m\"French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu\"\u001b[39m... 25 more characters,\n",
              "    evidence: \u001b[32m\"French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu\"\u001b[39m... 33 more characters\n",
              "  },\n",
              "  {\n",
              "    year: \u001b[33m1886\u001b[39m,\n",
              "    description: \u001b[32m\"German inventor Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car—a practical,\"\u001b[39m... 40 more characters,\n",
              "    evidence: \u001b[32m\"The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German\"\u001b[39m... 56 more characters\n",
              "  },\n",
              "  {\n",
              "    year: \u001b[33m1908\u001b[39m,\n",
              "    description: \u001b[32m\"The 1908 Model T, an American car manufactured by the Ford Motor Company, became one of the first ca\"\u001b[39m... 28 more characters,\n",
              "    evidence: \u001b[32m\"One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by\"\u001b[39m... 24 more characters\n",
              "  }\n",
              "]"
            ]
          },
          "execution_count": 6,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "const keyDevelopments = results.flatMap((result) => result.key_developments);\n",
        "\n",
        "keyDevelopments.slice(0, 20);"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "48afd4a7-abcd-48b4-8ff1-6ca485f529e3",
      "metadata": {},
      "source": [
        "## RAG based approach\n",
        "\n",
        "Another simple idea is to chunk up the text, but instead of extracting information from every chunk, just focus on the the most relevant chunks.\n",
        "\n",
        ":::{.callout-caution}\n",
        "It can be difficult to identify which chunks are relevant.\n",
        "\n",
        "For example, in the `car` article we're using here, most of the article contains key development information. So by using\n",
        "**RAG**, we'll likely be throwing out a lot of relevant information.\n",
        "\n",
        "We suggest experimenting with your use case and determining whether this approach works or not.\n",
        ":::\n",
        "\n",
        "Here's a simple example that relies on an in-memory demo `MemoryVectorStore` vectorstore."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "id": "aaf37c82-625b-4fa1-8e88-73303f08ac16",
      "metadata": {},
      "outputs": [],
      "source": [
        "import { MemoryVectorStore } from \"langchain/vectorstores/memory\";\n",
        "import { OpenAIEmbeddings } from \"@langchain/openai\";\n",
        "\n",
        "// Only load the first 10 docs for speed in this demo use-case\n",
        "const vectorstore = await MemoryVectorStore.fromDocuments(\n",
        "  splitDocs.slice(0, 10),\n",
        "  new OpenAIEmbeddings()\n",
        ");\n",
        "\n",
        "// Only extract from top document\n",
        "const retriever = vectorstore.asRetriever({ k: 1 });"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "013ecad9-f80f-477c-b954-494b46a02a07",
      "metadata": {},
      "source": [
        "In this case the RAG extractor is only looking at the top document."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "id": "47aad00b-7013-4f7f-a1b0-02ef269093bf",
      "metadata": {},
      "outputs": [],
      "source": [
        "import { RunnableSequence } from \"@langchain/core/runnables\";\n",
        "\n",
        "const ragExtractor = RunnableSequence.from([\n",
        "  {\n",
        "    text: retriever.pipe((docs) => docs[0].pageContent)\n",
        "  },\n",
        "  extractionChain,\n",
        "]);"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "id": "68f2de01-0cd8-456e-a959-db236189d41b",
      "metadata": {},
      "outputs": [],
      "source": [
        "const ragExtractorResults = await ragExtractor.invoke(\"Key developments associated with cars\");"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "id": "56f434ea-1869-4192-914e-3ccf64e72f75",
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[\n",
              "  {\n",
              "    year: \u001b[33m2020\u001b[39m,\n",
              "    description: \u001b[32m\"The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 million km (1.\"\u001b[39m... 33 more characters,\n",
              "    evidence: \u001b[32m\"The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 millionkm (1.2\"\u001b[39m... 31 more characters\n",
              "  },\n",
              "  {\n",
              "    year: \u001b[33m2030\u001b[39m,\n",
              "    description: \u001b[32m\"All fossil fuel vehicles will be banned in Amsterdam from 2030.\"\u001b[39m,\n",
              "    evidence: \u001b[32m\"all fossil fuel vehicles will be banned in Amsterdam from 2030.\"\u001b[39m\n",
              "  },\n",
              "  {\n",
              "    year: \u001b[33m2020\u001b[39m,\n",
              "    description: \u001b[32m\"In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.\"\u001b[39m,\n",
              "    evidence: \u001b[32m\"In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.\"\u001b[39m\n",
              "  }\n",
              "]"
            ]
          },
          "execution_count": 13,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "ragExtractorResults.key_developments;"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "cf36e626-cf5d-4324-ba29-9bd602be9b97",
      "metadata": {},
      "source": [
        "## Common issues\n",
        "\n",
        "Different methods have their own pros and cons related to cost, speed, and accuracy.\n",
        "\n",
        "Watch out for these issues:\n",
        "\n",
        "* Chunking content means that the LLM can fail to extract information if the information is spread across multiple chunks.\n",
        "* Large chunk overlap may cause the same information to be extracted twice, so be prepared to de-duplicate!\n",
        "* LLMs can make up data. If looking for a single fact across a large text and using a brute force approach, you may end up getting more made up data.\n",
        "\n",
        "## Next steps\n",
        "\n",
        "You've now learned how to improve extraction quality using few-shot examples.\n",
        "\n",
        "Next, check out some of the other guides in this section, such as [some tips on how to improve extraction quality with examples](/docs/how_to/extraction_examples)."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Deno",
      "language": "typescript",
      "name": "deno"
    },
    "language_info": {
      "file_extension": ".ts",
      "mimetype": "text/x.typescript",
      "name": "typescript",
      "nb_converter": "script",
      "pygments_lexer": "typescript",
      "version": "5.3.3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}
